Hashing
=======

Hash Table
----------

BSTs provide add, contains, and remove operations in O(log n) time.
Can we create a structure that improves on O(log n) time?
	Yes
	Hashing can implement a Set in O(1) per operation

Suppose we need to create a set of employee ID numbers, where each
ID number is between 1 and 1000.
	How can we store the numbers so we can add/contains/remove in
	  O(1) time?
	Make an array of size 1000 and index it with the item value
		add(i)
			array[i]++
		contains(i)
			return array[i] > 0
		remove(i)
			array[i]--

Could we use this approach to store items (like Strings) that are 
not numbers?
	Yes: convert the strings into numbers
	Characters are represented by codes (commonly from 0 to 127)
	Treat a string as a base-128 number

Could we use this approach to store larger numbers? Suppose we are 
storing only 1000 different numbers, but the numbers range in value 
from 1 to 1000000. Would we make an array of size 1000000?
	No: make an array closer to size 1000
	Use a hash function to map large numbers into smaller ones

What is a Hash Table?
	An array used to store items in a collection
	Items are stored in the array at the index given by a special
	  function known as a hash function

What is a Hash Function?
	A function that maps an item to a number
	The number is in the range of a valid array index
	The function usually contains a mod by table size (x % tableSize)

What is a Collision?
	When two or more items hash to the same number
	Resolve with chaining, linear probing, quadratic probing


Hash Functions
--------------

What are the characteristics of a good Hash Function?
	Gives a number between 0 and tableSize-1
	Easy/fast to compute
	Distributes items evenly throughout the hash table

What is a typical hash function for integers?
	value % tableSize

What is a typical hash function for Strings?
	A hash function for Strings must first convert a String to 
	  a number
	
	Consider the following function for converting a String to
	  a number
		public static int hash(String key, int size) {
		    int hashVal = 0;
		    for (int i = 0; i < key.length(); i++)
		        hashVal += key.charAt(i);
		    return hashVal % size;
		}

	What is wrong with this code?
		Same characters in different orders (e.g., tab, bat) 
		  produce the same result
		Gives only small numbers (10 chars give at most 1280)


	Consider the following function for converting a String to
	  a number
		public static int hash(String key, int size) {
		    int hashVal = 0;
		    for (int i = 0; i < key.length(); i++)
		        hashVal = (hashVal * 128 + key.charAt(i)) % size;
		    return hashVal;
		}

	How is this hash function better than the previous one?
		Characters are weighted by their positions
		-> Different orders give different numbers and larger 
		  numbers can be produced


	Consider the following function for converting a String to
	  a number
		public static int hash(String key, int size) {
		    int hashVal = 0;
		    for (int i = 0; i < key.length(); i++)
		        hashVal = hashVal * 128 + key.charAt(i);
		    hashVal %= size;
		    if (hashVal < 0)
	      	  hashVal += size;
		    return hashVal;
		}

	How is this hash function better than the previous one?
		The expensive mod operation is not repeated on every loop
	What problem does this code have with long strings?
		With long strings the early characters are lost from the 
		  answer


	Consider the following function for converting a String to
	  a number
		public static int hash(String key, int size) {
		    int hashVal = 0;
		    for (int i = 0; i < key.length(); i++)
		        hashVal = hashVal * 37 + key.charAt(i);
		    hashVal %= size;
		    if (hashVal < 0)
	      	  hashVal += size;
		    return hashVal;
		}

	How is this hash function better than the previous one?
		The early characters are not lost as quickly


What is a typical hash function for Java Objects?
	The hashCode method maps the object to an integer
	item.hashCode() % tableSize

How do we implement hashCode for classes you write?
	Combine the hashcodes for the parts of the object that 
	  make up the key

	public class Vehicle {
	    int state;	    // key
	    String id;	    // key
	    String make;
	    String model;
	    int year;

	    public boolean equals(Object obj) {
		if (obj instanceof Vehicle) {
		    Vehicle other = (Vehicle)obj;
		    return state == other.state && id.equals(other.id);
		} else {
	            return false;
		}
	    }

	    public int hashCode() {
		return state + id.hashCode();
	    }
	}


Collision Resolution
--------------------

Linear Probing
	How do you insert an item in a hash table using Linear 
	  Probing?
		Sequentially scan the array until an empty cell 
		  is found
		Wrap-around from the last position to the first

	Exercise
		Show the result of inserting 2, 7, 3, 8 into a hash 
		  table of size 5 with linear probing
		Show the result of inserting 7, 6, 1, 2 into a hash 
		  table of size 5 with linear probing

	How do we find an item in a hash table using Linear Probing?
		Follow the same sequence as insert
		If we reach an empty slot, the item is not in the table

	Exercise
		Show the result of finding 8 and 9 in the first hash 
	  	  table above

	How do we delete an item from a hash table using Linear Probing?
		We must use lazy deletion
		Mark items deleted

	Analysis of Linear Probing
		Load factor lambda = fraction of the table that is full
		Given a hash table with load factor lambda, what is the 
		  probability of hitting an empty cell with one probe?
			1 - lambda
		Assuming that probes are independent (not valid), how 
		  many probes (on average) are needed to find an empty 
		  slot in the table?
			1 / (1 - lambda)
			2 probes in a half-full table
			10 probes in a 90%-full table
		Why are the probes not independent? What is Primary 
		  Clustering?
			Large blocks of occupied cells are formed
			Insertions at indexes in the cluster are expensive
		Assuming that probes are not independent, how many probes 
		  (on average) are needed to find an empty slot in the table?
			(1 + 1 / (1 - lambda)^2) / 2
			2.5 probes in a half-full table
			50 probes in a 90%-full table
		Does Linear Probing work well enough to use in practice?
			Yes
			If we keep the load factor low (lambda < 0.5), then
			  linear probing works well
			If we could remove primary clustering, we'd only save 
			  0.5 probe on insert, 0.1 probe on find
		What is a good Load Factor to use with Linear Probing?
			Keep the load factor low, lambda < 0.5

Quadratic Probing
	How do you insert an item in a hash table using Linear 
	  Probing?
		Scan cells 1, 4, 9 away from the original probe
		If the hash function gives H, scan H+(i^2) instead of H+i
		  (i = 1, 2, ...)
		Wrap-around from the last position to the first
		Eliminates primary clustering
	
	Exercise
		Show the result of inserting 2, 7, 3, 8 into a hash table 
		  of size 5 with quadratic probing
		Show the result of inserting 7, 6, 1, 2 into a hash table 
		  of size 5 with quadratic probing

	Does quadratic probing try all cells?
		If the table size is prime and the load factor is <= 0.5,
		  then probes will be to different locations

	Implementation issues
		Linear Probing is fast and easy (increment, wraparound)
		Quadratic Probing appears expensive (increment, square, 
		  add, wraparound)
		Can we implement quadratic probing efficiently?

			increment i          (i+1)
			shift left 1 bit     (2(i+1))
			subtract 1           (2(i+1) - 1)
			add to old position
			test for wrap-around

		How does adding 2(i+1)-1 to the old position give the same 
		  result as H + (i+1)^2?

			(H + (i+1)^2) - (H + (i)^2) = 2i + 1 = 2(i+1) - 1

		What should we do when the load factor exceeds 0.5?
			Rehashing
			Make a new table twice as big (but use a prime number)
			Find the next prime number at least twice as large
		Do we copy each item from the old table to the same location 
		  in the new table?
			No, the hash function is different because the table 
			  size is different
			Hash each item into the new table

		How well does Quadratic Probing perform?
			Eliminates primary clustering
			Suffers from secondary clustering
				Elements that hash to the same position probe the 
				  same alternate cells
				Causes less than an extra half-probe per search
			Can we eliminate Secondary Clustering?
				Yes, use Double Hashing
				Use a second hash function for collision resolution
					H(x) + 0*H2(x)
					H(x) + 1*H2(x)
					H(x) + 2*H2(x)
					H(x) + i*H2(x)


Chaining
--------

How do we resolve collisions using Chaining?
	The hash table is an array of linked-lists
	The hash function tells to which list an item is added and 
	  in which list an item can be found

Exercise
	Show the result of inserting 2, 7, 3, 8 into a hash table of 
	  size 5 (use the "standard" hash function for integers)

What is the Big-Oh bound of add/contains/remove when chaining is 
used?
	Worst-case is O(n) (all items hash to the same bucket)
	Average-case is O(n/m) (items distributed evenly over the 
	  m buckets)

What is the Load Factor of a hash table?
	The fraction of the table that is full
	lambda = n/m

What is a good Load Factor to use with chaining?
	1.0
	Lower does not significantly improve performance
	Higher can save space

Which Hashing or AVL trees is better?
	Trees support order
	Hashing is O(1), trees are O(logN)
	Hashing is simpler


Implementing Hashing (Project #7)
---------------------------------

You will implement the Set interface (add/remove/contains) using 
hashing

What hash function will you use?
	Math.abs(item.hashCode % hash_table_size)

How will you resolve collisions?
	Chaining

The hash table will be an array of what type?
	Array of List

How will you implement size()?
	Use an integer variable
	Increment when items are added
	Decrement when items are removed

How will you implement add()?
	Use the hash function to get the index
	Call add on the List at the index

How will you implement contains()?
	Use the hash function to get the index
	Call contains on the List at the index

How will you implement remove()?
	Use the hash function to get the index
	Call remove on the List at the index

What will be the size the hash table?
	Initially 101

When will you rehash to a larger table?
	When the load factor exceeds 1

What size table will you create when you rehash?
	2 * hash_table_size + 1

How will you rehash into the new table?
	Iterate through the old table and add to the new table

You will also implement an Iterator (hasNext/next) that 
iterates over the hash table
	What data is needed in an Iterator for the hash table?
		The current index in the array
		The current position in the List at that index
	What order must the Iterator use when returning items?
		Sequentially from index 0 to hash_table_size-1
		For each index, give the items in the order they 
		  were added
	How will you implement next()?
		Move to the next item on the list
		If at the end of the list, move to the next index
	How will you implement hasNext()?
	How will you implement toArray()?
		Make a new array of the exact size
		Iterate over the table putting items in the array

You will then implement the Map interface (put/get/remove) using 
the hash set
	What data is needed?
		A hash table object
	What's stored in a Map?
		Key/value pairs
	How do you store key/value pairs in a Set?
		Make a Pair class that contains a key and a value
		Define the equals and hashCode methods using only the key
	What's stored in a Pair object?
		A key and a value
	How do you implement equals()?
		Return the result of equals on the keys
	How do you implement hashCode()?
		Return the result of hashCode on the key
	How do you implement clear(), isEmpty(), size()?
		hash.clear()
		hash.isEmpty()
		hash.size()
	How do you implement get()?
		Add a find method to the hash table implementation
		Find returns the object that matches
		Make a pair with the key and a null value
		Call find with the pair
	How do you implement put()?
		If a pair with the same key is in the table, remove it
		Save the old value to be returned
		Make a new pair with the key and value
		Add the new pair to the hash table
	How do you implement remove()?
		Save the old value to be returned
		Make a pair with the key and a null value
		Remove the pair from the hash table